-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
REF: sql insert_data operate column-wise to avoid internals #33229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REF: sql insert_data operate column-wise to avoid internals #33229
Conversation
pandas/io/sql.py
Outdated
# object array of Timedeltas | ||
d = b.values.astype(object) | ||
|
||
for i in range(len(temp.columns)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the issue with using .items
that it can provide a frame back in case of duplicate labels? If so wonder if we shouldn't just update .items
to do what you have done here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean BlockManager.items? i dont see how thats relevant here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
temp
is a DataFrame, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. this is just doing this column-wise instead of block-wise to avoid accessing ._data
|
||
for i in range(len(temp.columns)): | ||
ser = temp.iloc[:, i] | ||
vals = ser._values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we start with not using iloc for this pattern (that gives a lot of overhead).
Not that it will matter very much in this case, though, I suppose, since we are converting to object dtype below which will be more costly.
See eg the helper function you asked me to remove in 07372e3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, _ixs works here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although _ixs helps, the big win actually comes from avoiding creating a series if all you need are the values (as I did in the linked snippet)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Until #33252 goes through, any objection to _ixs here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i agree with jeff on this one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's exactly the goal of the helper function I am adding in #33252 to be used in situations where only the array values are needed, like here is the case.
So if we are adding such helper method, why not use it? (the reason I did the PR is because I want to see it used in several places, and many of those places will be outside of frame.py, eg in the ops code)
If the privateness of the method is a problem, let's discuss making it public. Although I personally think that is not needed right now (and we are using plenty of semi-private methods on DataFrame/Series outside of frame.py
already)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets revisit that once #33252 is merged
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR certainly does not need to wait on #33252 being merged, but I still first want to have the discussion whether we find this an appropriate place to use that function here, once it exists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should be using internal functions except in the internals. This is not internal by any stretch (would say that pandas.io is offlimits entirely)
|
||
for i in range(len(temp.columns)): | ||
ser = temp.iloc[:, i] | ||
vals = ser._values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really don't like using private methods like this, what is wrong with .iloc here?
|
||
for i in range(len(temp.columns)): | ||
ser = temp.iloc[:, i] | ||
vals = ser._values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR certainly does not need to wait on #33252 being merged, but I still first want to have the discussion whether we find this an appropriate place to use that function here, once it exists.
pandas/io/sql.py
Outdated
d = b.values.astype(object) | ||
|
||
for i in range(len(temp.columns)): | ||
ser = temp.iloc[:, i] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should just use .items() here no? which yields the name and column
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we're going to need i
below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hinted at this in my first comment but I do also agree it would be nice to use items
- we seem to special case it a lot (I've done so in groupby) I think just to work around the fact that it can return a DataFrame for duplicate labels
If we didn't do that maybe we could just enumerate over df.items()
and have one way of doing things
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hinted at this in my first comment
Oh! you meant DataFrame.items! I thought you were referring to BlockManager.items, which matches DataFrame.columns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes (with an enumerate on top if you need the i)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated to use enumerate and .items
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good. Only comment I have is that this probably will fail with duplicate column labels if not already tested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DataFrame.items has a check for columns.is_unique; is that check not enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah thanks; misunderstanding on my end
Can you please explain a bit more why you don't want a method like Yes, we want to use as much public functions in the other pandas modules. But we do have a set of "private" methods that we use ourselves, that I would see as "internal developer APIs". For example, we also use Or, do you want to limit the use here because it is outside pandas/core? (initially, you commented you didn't want private DataFrame methods to be used outside of frame.py, but maybe you meant outside pandas/core, and within all of /core it is fine to use those? In which case the argument would only be about pandas.io, pandas.plotting, pandas.tseries, more or less) |
Can we please bike-shed this elsewhere? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm I think a nice cleanup
No description provided.